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Abstract 

We study the problem of linear and convex aggregation of M estimators of a density with respect 
to the mean squared risk. We provide procedures for linear and convex aggregation and we prove 
oracle inequalities for their risks. We also obtain lower bounds showing that these procedures are rate 
optimal in a minimax sense. As an example, we apply general results to aggregation of multivariate 
kernel density estimators with different bandwidths. We show that linear and convex aggregates 
mimic the kernel oracles in asymptotically exact sense for a large class of kernels including Gaussian, 
Silverman's and Pinsker's ones. We prove that, for Pinsker's kernel, the proposed aggregates are sharp 
asymptotically minimax simultaneously over a large scale of Sobolev classes of densities. Finally, we 
provide simulations demonstrating performance of the convex aggregation procedure. 
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1 Introduction 

Consider i.i.d. random vectors X\, . . . ,X n with values in M rf having an unknown common probability 
density p e L2(fil d ) that we want to estimate. For an estimator p of p based on the sample X 
(Xi, . . . , X n ), define the L2-risk 

R n (p,p) = E; i \\p- P \\ 2 , 

where E™ denotes the expectation w.r.t. the distribution P™ of X n and, for a function g £ ^(M^), 

1/2 

|.9|| = ( / g 2 (x)dx ' 

Suppose that we have M > 2 estimators pi , . . . , pM of the density p based on the sample X™ . The problem 
that we study here is to construct a new estimator p n of p, called aggregate, which is approximately at 
least as good as the best linear or convex combination of §\, . . . ,§m- The problems of linear and convex 
aggregation of density estimators under the L2 loss can be stated as follows. 

1. Problem (L): linear aggregation. Find a linear aggregate, i.e. an estimator p 1 ^ which satisfies 

Rn(Pn,p)< mf M R n (p x ,p) + A^ M (1.1) 



n 
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for every p belonging to a large class of densities V, where 

M 

P\ = ^2^Pj, A= (Ai,...,A M ), 
j=i 

and A^ M is a sufficiently small remainder term that does not depend on p. 
2. Problem (C): convex aggregation. Find a convex aggregate, i.e. an estimator p~ which satisfies 

Rn(p5,P) < mf i?„(PA,p) + A£ M (1.2) 

for every p belonging to a large class of densities V, where A^ M is a sufficiently small remainder 
term that does not depend on p, and H is a convex compact subset of K M . We will discuss in more 
detail the case H = A M where A M is a simplex, 

M 

A M = {\eM M : Xj>0,J2^ < 1 }- 

i=i 

Our aim is to find aggregates satisfying or l|1.2|) with the smallest possible remainder terms A^ M 
and Aj^ M . These remainder terms characterize the price to pay for aggregation. 

Linear and convex aggregates mimic the best linear (respectively, convex) combinations of the initial 
estimators. Along with them, one may consider model selection (MS) aggregates that mimic the best 
among the initial estimators pi, ■ ■ ■ ,pm- We do not analyze this type of aggregation here. 

The study of convergence properties of aggregation methods has been initiated by Nemirovski (2000), 
Catoni (1999, 2004) and Yang (2000). Most of the results were obtained for the regression and Gaussian 
white noise models (see a recent overview in Bunea, Tsybakov and Wegkamp (2004)). Aggregation of 
density estimators has received less attention. The work on this subject is mainly devoted to the MS 
aggregation with the Kullback-Leibler divergence as a loss function [Catoni (1999, 2004), Yang (2000), 
Zhang (2003)], and is based on information-theoretical ideas close to the earlier papers of Barron (1987), 
Li and Barron (1999). Devroye and Lugosi (2001) developed a method of MS aggregation of density 
estimators satisfying certain complexity assumptions under the L\ loss. 

To our knowledge, linear aggregation of density estimators has not been previously studied. For 
convex aggregation, the only paper we are aware of is that of Birge (2003) where this type of aggregation 
under the L\ loss is considered, while we study here the L 2 loss. In his setup, Birge (2003) proves an 
inequality which is weaker than Q1.2[l . with the oracle risk on the right hand side multiplied by a constant 
which is much larger than 1. 

We do not only suggest aggregates satisfying sharp oracle inequalities (|l.l(l . I|1.2JI . but also demonstrate 
their optimality. Namely, we introduce the notion of optimal rate of aggregation and show that our 
aggregates attain optimal rates. This extends to density estimation context some results of the paper of 
Tsybakov (2003) where optimal rates of aggregation for the regression model have been obtained. 

The main purpose of aggregation is to improve upon the initial set of estimators pi, ■ ■ ■ ,Pm- This 
is a general tool that applies to various kinds of estimators satisfying very mild conditions (we only 
assume that they are square integrable). Consider, for example, the simplest case when we have only two 
estimators (M — 2), where p\ is a good parametric density estimator for some fixed regular parametric 
family and pi is a nonparametric density estimator. If the underlying density p belongs to the parametric 
family, p\ is perfect: its risk converges with the parametric rate 0{l/n). But for densities which are not 
in this family it may not converge at all. As for p2, it converges with a slow nonparametric rate even 
if the underlying density is within the parametric family. Aggregation (cf. Section 2 below) allows one 
to construct procedures that combine the advantages of both pi and p2'- the convex or linear aggregates 
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converge with the parametric rate 0(l/n) if p is within the parametric family, and with a nonparametric 
rate otherwise. Similar use of aggregation can be done in the problem of adaptation to the unknown 
smoothness (cf. Sections 5 and 6). In this case the index j of pj corresponds to a value of the smoothing 
parameter, and the adaptive estimators in the oracle or minimax sense can be obtained as linear or convex 
aggregates. Of course, there exists a large variety of other methods of adaptation to unknown smoothness. 
In the numerical examples that we consider, our aggregates are comparable to benchmarks, and show 
somewhat more stable behavior for densities with highly inhomogeneous smoothness (cf. Section 7) . It is 
important to note that aggregation can be used for adaptation to other characteristics than smoothness, 
for example, to the dimension of the subspace where the data effectively lie, under dimension reduction 
models [cf. Samarov and Tsybakov (2005)]. 

In this paper, we consider only one example of application of our general results to the problem of 
adaptation to the unknown smoothness. Specifically, we deal with aggregation of multivariate kernel 
density estimators with different bandwidths. Here the number M = M n of the estimators depends on n 
and satisfies M n /n — > 0, as n — > oo. We show in Corollary 15 . II that linear and convex aggregates mimic 
the kernel oracles in sharp asymptotic sense. This corollary is in the spirit of Stone's (1984) theorem 
on asymptotic optimality of cross-validation, but it is more powerful in several aspects because it is 
obtained under weaker conditions on p and covers kernels with unbounded support including Gaussian, 
Silverman's and Pinsker's kernels. Another application of our results is that, for Pinsker's kernel, we 
construct aggregates that are sharp asymptotically minimax simultaneously over a large scale of Sobolev 
classes of densities in the multidimensional case. 

To perform aggregation, we use a sample splitting scheme. The sample X™ is split into two independent 
subsamples X" 1 (training sample) and Xj (validation sample) of sizes m and I respectively where m+£ = n 
and usually m S> £■ The first subsample X™ 1 is used to construct estimators pj = p m j, j = 1, . . . , M, 
while the second subsample ~K l 2 is used to aggregate them, i.e., to construct p n (thus, p n is measurable 
w.r.t. the whole sample X"). In a first analysis we will not consider sample splitting schemes but rather 
deal with a "pure aggregation" framework (as in most of the papers on the subject, cf. ,e.g., Nemirovski 
(2000), Juditsky and Nemirovski (2000) and Tsybakov (2003) for the regression problem) where the 
first subsample is frozen. This means that instead of the estimators pi, . . . ,Pm we have fixed functions 
Pi, . . . ,pm and that the expectations in oracle inequalities are taken only w.r.t. the second subsample. 

This paper is organized as follows. In Section|21we introduce linear and convex aggregation procedures 
and prove that they satisfy oracle inequalities of the type and (|1.2fl . Section|3provides lower bounds 
showing optimality of the rates obtained in Section [21 Consequences for averaged aggregates are stated 
in Section 0] In Sections and we apply the results of Sections [3 and 01 to aggregation of kernel density 
estimators. Section contains a simulation study. Throughout the paper we denote by Ci finite positive 
constants. 

2 Oracle inequalities for linear and convex aggregates 

In this section, pi, . . . ,pm are fixed functions, not necessarily probability densities. ^From now on the 
notation p\ for a vector A = (Ai, . . . , Xm) € M M is understood in the following sense: 



M 




and, since for any fixed A G JR. , the function p\ is non-random, we have 



R n {p\,p) = || Pa -p|| 2 . 
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Denote by Vo the class of all densities on !R d bounded by a constant L > 0: 

V = |p:lR d ^IR|p> Q,J p(x)dx= 1, IHU < , 

where || • Hoc stands for the Loo(]R d ) norm. The constant L need not be known to the statistician. 

We first give an oracle inequality for linear aggregation. Denote by C the linear span of pi, . . . ,Pm- 
Let <t>i, ■ ■ ■ ,<f>M' with M' < M be an orthonormal basis of C in L 2 (Sl d )- Define a linear aggregate 

M 1 

i=i 

where 

Theorem 2.1 Assume that pi, ■ ■ ■ ,Pm € L 2 (lR d ) and p G 7V T/ien 

R n (Ptp) < nun ||pa -P|| 2 + ^ (2.2) 

/or any integers M > 2 and n > 1 . 

Proof. Consider the projection of p onto C: 

M' 

P*c - argmin ||p A - p\\ 2 = ^ A*^-, 

where A* = (p,(f>j), and (•,•) is the scalar product in L 2 (1R' 1 )- Using the Pythagorean theorem we get 
that, almost surely, 

M' 

ii^-pii 2 = E(\ l - a ;) 2 + ii^-pii 2 - 

To finish the proof it suffices to take expectations in the last equation and to note that i?"(A^) = A* and 



e; 



L \*\2 



(\*f - A*) 2 = Var(A^) < - ( <f> 2 (x)p(x)dx < - . 

n Ju d n 



Consider now convex aggregation. Its aim is to mimic the convex oracle defined as A* = argmin Aeff || p A - 
p\\ 2 where H is a given convex compact subset of TR M . Clearly, 

||p A - P || 2 = ||p A || 2 -2 / p x p+\\ P \\ 2 . 

Removing here the term ||p|| 2 independent of A and estimating J ud PjP by n~ x J27=i Pj(-^i) we S e t the 
following estimate of the oracle 

A c =argminJ||p A || 2 -- VpA^)!. (2.3) 
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Now, we define a convex aggregate p^ by 



M 

~C A \C 
Pn =2-^ X jPj = Pac- 



Theorem 2.2 Let H be a convex compact subset o/lR M . Assume thatpi, . . . ,pm £ ^(IR-^) and p £ V$. 
Then the convex aggregate p^ satisfies 

i?n(^,p)<min||p A -p|| 2 + ^ (2.4) 

for any integers M > 2 and n > 1. 

PROOF. We will write for brevity A = A c . First note that the mapping A 1— > ||pa|| 2 — - Y^7=i Pa(-^i) 
is continuous, thus A exists, and the oracle A* — argmin Agff \\p\ — p\\ 2 also exists. The definition of A 
implies that, for any p £ Vo, 

\\PX-P\\ 2 < Hpa- -p\\ 2 + 2T n (2.5) 



where 



1 " r 

T " = -Epa-a4^)- / Pa-a-P 



Introduce the notation 



sup 

PER": ||p m ||#o IIP/* II 

Using the Cauchy-Schwarz inequality, the identity P A _ A * = P A — Pa* an d the elementary inequality 
'i-y/xy < ax + y / 'a, Vs, y, a > 0, we get 

E;\T n \ < e; (z n \\ Px _ x 4) 



< ^(||p A -p A ,|| 2 ) + ^££(2 2 ), Va>0. (2.6) 

Representing p^ in the form p M = Yli=i v ~t§i w here v\ £ R and {</>/} is an orthonormal basis in £ (cf. 
proof of Theorem 12. 1(1 we find 



where \v\ — {Y^iLi^^j aim 



n 



Hence 

< ^^^Wl < (2-7) 
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whenever ||p||oo < L. Since {p\ : A € H} is a convex subset of L2(JR, d ) and pa* is the projection of p onto 
this set, we have 

IIpa-pII 2 > IIpa- -pII 2 + IIpa-pa*II 2 , V A e if, p e L 2 {m d ). (2.8) 

Using with A = A, (ESJ and 

we obtain 

e;\t ti \ < I {e;(|| p ^ - p|| 2 - || PA . - P || 2 )} + ^ . 

This and l|2.5() yield that, for any < a < 1, 

2^ll„ 112 LM 



K(iiPA-pir)<iiPA--pir + 



a(l — a)n 

Now, (|2.4[1 follows by taking the infimum of the right hand side of this inequality over < a < 1 . | 

3 Lower bounds and optimal aggregation 

We first define the notion of optimal rate of aggregation for density estimation, similar to that for the 
regression problem given in Tsybakov (2003). It is related to the minimax behavior of the excess risk 

£(Pn,P,H) = R n (p n ,p) - inf ||pa-.p|| 2 

xeh 

for a given class H of weights A. 

Definition 3.1 Let V be a given class of probability densities on WL d , and let H C ]R M be a given class 
of weights. A sequence of positive numbers Tp n {M) is called optimal rate of aggregation for H over 
V if ' 

• for any functions Pj € L 2 (JR <i ),j = 1, . . . , M, there exists an estimator p n of p (aggregate) such that 

< ap n {M), (3.i) 

for any integer n > 1 and for some constant C < oo independent of M and n, 



sup 



Rn(Pn,P) - inf || PA - p\\ 
XeH 



and 

• there exist functions pj € L2(fii d ),j — 1, . . . , M, such that for all estimators T n of p, we have 



sup 

pev 



R n {T n ,p) - inf || pa - p\\ 



> ap n (M), (3.2) 



for any integer n > 1 and for some constant c > independent of M and n. 

When (|3.2|l holds, an aggregate p n satisfying (|3.1|) is called rate optimal aggregate for H over V . 

Note that this definition applies to aggregation of any functions pj in L2(Si d ), they are not necessarily 
supposed to be probability densities. 

Theorems 12 . II and 12 1 21 provide upper bounds of the type (|3.1|) with the rate ip n (M) = LM/n for linear 
and convex aggregates p n = p 1 ^ and p n = p^ when V = Vo and H — M M or H is a convex compact 
subset of 1R M . In this section we complement these results by lower bounds of the type (|3.2|) showing 
that ip n (M) = LM/n is optimal rate of linear and convex aggregation. The proofs will be based on the 
following lemma which is adapted from Corollary 4.1 of Birge (1986), p. 281. 
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Lemma 3.1 Let C be a set of functions of the following type 



C 



f + Y,**, *<e{0,l}, i = l, 
=1 



where the gi are functions on TR d with disjoint supports, such that J gi(x)dx = 0, f is a probability density 
on lR d which is constant on the union of the supports of gi 's, and f + g. L > for all i. Assume that 



mm 

KKi 



||ft|| 2 >a>0 



max h 2 (fj + 9l )<l3<l. 

l<i<r 



(3.3) 



where h 2 (f,g) = (1/2) f{y/f(x) — \f g(x)) dx is the squared Hellinger distance between two probability 
densities f and g. Then 

rrv , 



inf sup R n (T n ,p) > — (1 

where inf denotes the infimum over all estimators. 

T n 

Consider first a lower bound for linear aggregation of density estimators. We are going to prove l|3.2|) 
with i/) n (M) = LM/n, V = Vq and H = TR M . Note first that for V = Vq there is a natural limitation 

on the value cip n (M) on the right hand side of l|3.2|l . whatever is H . In fact, inf^„ sup pg7 r, R n (T n ,p) — 

infAeff || Pa -p\\ 2 < infr„ su P p ev R n(T n ,p) < Sup pePo R n (0,p) = sup peVo \\p\\ 2 < L. Therefore, we must 
have cip n (M) < L where c is the constant in (|3.2() . For ip n (M) — LM/n this means that only the values 
M such that M < con are allowed, where cq > is a constant. The upper bounds of Theorems 12.11 and 
12 .21 are too rough (non-optimal) when M = M n depends on n and the condition M < cqu is not satisfied. 
In the sequel, we will apply those theorems with M = M n depending on n and satisfying M n /n — > 0, as 
n — ► oo, so that the condition M < cqh will obviously hold with any finite cq for n large enough. 

Theorem 3.1 Let the integers M > 2 and n > 1 be such that M < cqh where Co is a positive constant. 
Then there exist probability densities pj £ L2QR. ), j = 1, . . . , M, such that for all estimators T n of p we 
have 



inf sup 

T « pev 



R n (T n ,p)~- inf 
AeB Ai 



Pa-p|| 2 >cLM/n 



(3.4) 



](t) 



,(*), 



where c > is a constant depending only on cq. 

Proof. Set r = M — 1 > 1 and fix < a < 1. Consider the function g defined for any t E 1R by 

aL 

where Ha(") denotes the indicator function of a set A. Let {gj}j =1 be the family of functions defined 
by gj(t) = g(t - 2(j - l)/Lr), 1 < j < r. Define also the density f(t) = (L/2)l [Qa / L ]{t), t £ IR. For 
x = (x±, . . . , Xd) € H d consider the functions 

d d 

f(x) = f(xi) JJ l[ ,i](x k ) gj(x) =gj(x 1 ) JJ I [0l i](xfc), j = l,...,r. 

k=2 k=2 

Define the probability densities pj by p\ — f, pj+i = f + gj, j = 1, . . . , M — 1. 

Consider now the set of functions Q = {qs : 95 = / + X^=i ^iSi; $ = (^i> • • • > ^r) G {0, Clearly, 
for any <5 S {0, l} r , satisfies J Rd g,5(a;)da; = 1, > and H^Hoo < L. Therefore Q C Po- Also, 
Qc{p A ,AeIR M }. Thus, 



inf sup 

T « pev 



Rn(T n ,p) 



inf 

AeR Aj 



\P\-P\ 



> inf supi?„(T n ,p) . 

Tn peQ 
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To prove that infy n sup pe g R n (T n ,p) > cLM/n we check conditions (I3.3|) of Lemma 13.11 The first 
condition in l|3.3|) is obviously satisfied since 



\9i\ 



5 2 (i)di=— , 3 = 1,. 



To check the second condition in (|3.3|) , note that for j = 1, . . . , r we have 

h 2 (fj + gj) 



2 7o 

2 

-f 

L " 
4 



/ L/2- y/L/2 + g{t) dt 



1 - v/1 + (2/L)5(t) ) dt 



±-2 



\A + (2/L)ff(t)di 



where we used the fact that y/l + a + VI — » > 2 — a 2 for |a| < 1. Define now Co = max(co, 3) and choose 
a 2 = M/(c a n) < 1. Then a 2 /(2r) < (con)" 1 for M > 2. Applying Lemma EH1 with /3 = (c n) _1 and 
ML 

a = — we get 



2donr 



1 / [2 \ LM 
infsupi?„(T„,p) >— 1- J-) 

T„ peC 8cq\ V Cq / 71 



Theorems 12.11 and 13.11 imply the following result. 

Corollary 3.1 Let the integers M > 2 ond n > 1 be such that M < co« where cq is a positive constant. 
Then ip n (M) = LM jn is optimal rate of linear aggregation over Vo (i.e. the optimal rate of aggregation 



for H = H 



Vo), and defined in (|2.1|) is rate optimal aggregate for H M over V§. 



Consider now a lower bound for convex aggregation. We analyze here only the case H = A M . Other 
examples of convex sets H can be treated similarly. 

Theorem 3.2 Let the integers M > 2 and n > 1 be are such that M < cqu. Then there exist functions 
Pj G L2(Si d ), j — M, such that for all estimators T n of p we have 



inf sup R n {T n ,p)- inf ||p A -p|| >cLM/n 
T « pev 1 aga m 



(3.5) 



where c> is a constant depending only on cq. 



Proof. Consider the same family of densities Q as defined in the proof of Theorem l3.ll We may rewrite 
it in the form Q = {q s : qs = XiMf + X J+1 MS jgj , 5 = (S±,. . . , S r ) G {0, l} r } where X 3 = 1/M, 

j = 1, . . . M. Define now p\ = Mf, Pj+i = M(f + gj), j — 1, . . . , M — 1. Since Ylj=i = 1 we have 
Q C {pa, A G A m }. The rest of the proof is identical to that of Theorem 13. II | 



Theorems 12.21 and 13.21 imply the following result. 
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Corollary 3.2 Let the integers M > 2 and n > 1 be such that M < cgn. Then %p n {M) — LM/n is 
optimal rate of convex aggregation over Vq (i.e. the optimal rate of aggregation for H = A M over Vq), 
and is rate optimal aggregate for H — A M over Vo ■ 

Inspection of the proofs of Theorems 13.21 and 13.11 reveals that the least favorable functions Pj used 
in the lower bound for linear aggregation are uniformly bounded by L, whereas this is not the case for 
least favorable functions in convex aggregation. It can be shown that, for convex aggregation of functions 
which arc uniformly bounded by L, an elbow appears in the optimal rates of aggregation, with the bound 
(|3.5I) still remaining valid for M < y/n. This issue will be treated in a forthcoming paper of the first 
author. 



4 Sample splitting and averaged aggregates 

We now come back to the original problem discussed in the introduction. Let X™ denote a subsample of 
X n = {X\ , . . . , X n ) of size m < n (training sample) . Take m < n and construct estimators p m ,i > • • • j Pm,M 
of p based on X™. Then aggregate these estimators using the validation subsample X 2 of X™ of size 

£ = n — m, 

(X?,X i 2 )=X n = (X 1 ,...,X n ). 

For given m < n the two subsamples can be obtained by different splits. The choice of split is arbitrary, 
and it may influence the result of estimation. In order to avoid the arbitrariness, we will use a jackknife 
type procedure averaging the aggregates over different splits. Define a split S of the initial sample X™ 
as a mapping 

S : X n ^ (X™,X^) . 

Denote by X" l 5 , Xj s subsamples obtained for a fixed split S and consider an arbitrary set of splits S. It 
can be, for example, the set of all splits. Define as a linear or convex aggregate (p£ or p^ respectively) 
based on the validation sample 5 and on the initial set of estimators pj = p^j,j = 1, ■ ■ ■ , M, where 
each of pf n j 's is constructed from the training sample X" l 5 . Introduce the following averaged aggregate 
estimator: 

Let H be either H M or a convex compact subset of R M . Define 

f LM/l HH = TR M , 
> M - | 4lm/£ if H is a convex compact subset of JR M . 

We get the following corollary of Theorems 12.11 and 12.21 

Corollary 4.1 Let m < n, I — n — m, and let H be either 1R M or a convex compact subset oflR AI . Let 

5 be an arbitrary set of splits. Assume that p^, . . . ,pf n _M € L2(M, d ) for fixed X"^, V<S G S. and that 
p G Vq. Then the averaged aggregate (|4.1|l satisfies 

M 

Rn (Pn , P) < R m ( X jP™<3 > P) + A ^ M ( 4 - 2 ) 

e 3=1 

for any integers M > 2 and n > 1. 

Proof. For any fixed S G S and for a fixed training subsample X™^ inequalities 12.21 and l|2.4|l imply 

M 2 

E e p ' S \\P*-p\\ 2 <mm\\ Y J \Pm,j-p +A,, M , Vp€P , (4.3) 

AGii II ' * 

3=1 



9 



where E^ ,s denotes the expectation w.r.t. the distribution of the validation sample Xf; s when the true 
density is p. Taking expectations of both sides of l|4.3|) w.r.t. the training sample X™ 5 we get 



M 

jPm,j,P) + &t,M- ( 4 - 4 ) 

3=1 



The right hand side here does not depend on S. By Jensen's inequality, 
This and JOJ yield l|OJ) . 



5 Kernel aggregates for density estimation 

Here we apply the results of the previous sections to aggregation of kernel density estimators. Let p m ,h 
denote a kernel density estimator based on X™ with m < n, 



i=l ^ ' 

where h > is a bandwidth and K G -^(IR^) is a kernel. The notation p m .h is slightly inconsistent with 
Pm.j used above but this will not cause ambiguity in what follows. In order to cover such examples as 
the sine kernel we will not assume that K is integrable. 

Define h$ — (n\ogn)~ 1 / d , a n = eto/logn, where clq > is a constant, and M such that 

M -2 = max {j G IN : h (l + a n ) j < 1} . 

It is easy to see that M < C4(logn) 2 , where C4 > is a constant depending only on do and d. Consider a 
grid TL on [0, 1] with a weakly geometrically increasing step: 

^ — {^o, hi, ... , flM-l) , 

where hj = (1 + a n ) 3 ho, j = 1, . . . , M — 2, and /im-i — 1- Fix now an arbitrary family of splits S such 
that, for n > 3, 

m = I n ( 1 — (log rt) -1 ) I and I = n — m > , 

v ' log n 

where \_x\ denotes the integer part of x. 

Define as the linear or convex (with H = A M ) averaged aggregate p„ where the initial estimators 
are taken in the form pj = p m .h _i, J = 1, ■ ■ ■ , M, with p mi h given by Q5.1[) . Since A^m < ALM/i we get 
from (|4.2|l that, under the assumptions of Corollary 14.11 

R n (Pn K ,p) < minRm(p m h ,p) + Ag m < mmR m (p mJl ,p) + 4c4 ( lo S n ) (5 2 ) 

heH h£H n 

We now give a theorem that extends l|5.2[l to the n-sample oracle risk m£h>o R n (Pn,h,p) instead of 
min^g-H R m (Pm,h,p)- Denote by T\f~\ the Fourier transform defined for / G L2QR ) and normalized in 
such a way that its restriction to / G L 2 (B, d ) C ii(JR d ) has the form T[f]{t) = J Rd e" T */(x)dx, t G IR d . 
In the sequel tp = T[p] denotes the characteristic function associated to p. 
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Theorem 5.1 A SSU7T16 that p satisfies ||p||oo 

< L with < L < oo and let K £ L2(TR d ) be a kernel such 
that a version of its Fourier transform T\K\ takes values in [0, 1] and satisfies the monotonicity condition 
T[K]{h't) > T[K](ht), yt e H d , h > h' > 0. Then there exists an integer n = n (L, \\K\\) > 4 such 
that for n > uq the averaged aggregate pn ,K satisfies the oracle inequality 

Rn(Pn K ,p) < (1 + CsGogn)- 1 ) inf Rn(Pn,h,P) + cj-^t ( 5 . 3 ) 

h>o n 

where C5 is a positive constant depending only on d and ag, and Cq > depends only on L, \\K\\,d and oq. 

Proof. Assume throughout that n > 4. First note that ((5.3(1 deduces from 1)5.2(1 and from the following 
two inequalities that we are going to prove below: 

inf Rn(Pn,h,p) < inf Rn{Pn.h,p) + \\K\\ 2 ^ , (5.4) 
he[ho,h M -i] h>0 n 

min RmiPm.h^t^p) < (l + c 5 (logn)~ 1 ) inf R n {p n .h,p)+ t~ ■ (5.5) 
In turn, ((5.4(1 follows if we show that 

inf R n {Pn,h,p) < inf R n {p n .h,p), (5.6) 

h€[h ,h M -!] 0<h<h o 

inf Rn(p n ,h,p)< mf Rn{pn.h,p) + \\K\\ 2] ^. (5.7) 

/iGlfto.^Af-l] h>h M -l n 

Thus, it remains to prove ((5.5|) - 1(5.7(1 . We will use the following Fourier representation for MISE of 
kernel estimators that can be easily obtained from Plancherel's formula (it is a multivariate extension of 
the representation for d = 1 given, e.g., in Golubev (1992) and in Wand and Jones (1995), p. 55): 

MPn,h,P) = / (|1 - T[K](ht)\ 2 \ V (t)\ 2 

(27F) jRdK (5.8) 
+ -(1-|^)| 2 ) \T[K](ht)\ 2 yt. 

Furthermore, using Plancherel's formula we get 

:) d f p 2 (x)dx < (2n) d L, 

(5.9) 



\f(t)\ 2 dt = {2n) d / p 2 {x)dx < (27r) d L, 

W JTR d 



1 



(2ir) d 



\T[K]{ht)\ 2 dt = h- d \\K\\ 2 , V/i > 0. 



Proof of (EH- Using f5~3|) and the fact that < T[K]{t) < 1, Vi G M d , for any h < h = 

(nlogn) -1 ^ we obtain 

Rn{Pn,h,P)>-7^r d I {l-W{t)\ 2 )\T[K]{ht)\ 2 At>^-->\\K\\ 2 \ogn--. (5.10) 

TlyZTT j JJRd Till Tl n 

On the other hand, since /^aj-i = 1 w ^ get 

J^(iW-i>P)<7CT / f|l-^]W| 2 |^)| 2 + -|^]W| 2 )dt<i+M. (s.ii) 
\ Z7T ) Jn d \ n j n 
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The right hand side of (|5.1U|) is larger than that of (|5.11|) for n > uq, where no depends only on L and 

||if||. Thus, ||ESl is valid for n > n . 

Proof of 1(5. 7|) . Clearly, l|5.7|l follows if we show that 



R n (Pn,h',p)< inf R n (p n , h ,p) + \\K 

h>h M _ 1 



2 \ogn 



for h' = (logn) 1 ' d £ [/i ,^m-i]. To prove this inequality, first note that, by the monotonicity of 
h i ► T[K](ht), we have 

\l-T[K]{ht)\ 2 \ v {t)\ 2 dt> f \l-T[K]{h't)\ 2 \ v {t)\ 2 dt, Vh>h M -i. 



This, together with 1(5. 8() and the second equality in 1(5.9(1 . yields that, for any h > h 



M-l, 



Rn(Pn,h,p) > Rn(Pn,h',p) ~ ^J^d / (l - \^{t)\ 2 ) \T[K]{h' t)\ 2 dt > R n { PnM ,p) - \\K\ 



n{2n) d J Rd 

Proof of 1(5. 5|) . We will show that for any h £ [ho, /im-i] one has 

Rm{p m h,p) < (1 + c 5 (logn)- 1 ) R n (p n , h ,p) + -y^- (5.12) 

where h = m&x{hj : hj < h}. Clearly, this implies 1)5.5(1 . To prove ((5.12(1 . note that if hj < h < hj+i 
we have h — hj, h/hj < 1 + a n = 1 + ao/logn. Therefore, 1(5.8(1 and the monotonicity of h i— > T[K](ht) 
imply 

R m (p m , h „p) = / f [i - nK](h 3 t)f \ v (t)\ 2 + - irmihjt)] 2 ) dt 

{ Z7T ) Jn d \ to y 

|^(t)| 2 [^[X](^i)] 2 dt 



(27r) d m J R d 

1 /" / P . _ r . * , 1 r _ r „ 9 



s is? jL ( 11 - ^ [A1( " t » 12 Mf)|2 + s [JF| ^ l( '*' )l3 ^?) d< 



' ' |^(t)| 2 [^[^](M)]^dt 



(2n) d n J Rd 



nh d I nh d \ 1 /" 

Using here the fact that (n/m)(h/hj) d < (1 — (logn) -1 — n -1 )(l + ao/ logn) d < 1 + C5(logn)~ 1 for n > 4 
and for a constant C5 > depending only on d, ao, and applying 1(5. 9|) we get 1(5.12(1 . | 



Corollary 5.1 Let the assumptions of Theorem \5.1\ be satisfied, and let inf^o R n (pn,h,p) > cn /or 
some c > 0, a > 0. TTien 

i? n (^' K ,p) < inf R n (p n ,h,p)(l + o(l)), n -» 00. (5.13) 

fi>0 

Using the argument as in Stone (1984) it is not hard to check that the assumption of Corollary 15. II is 
valid for any non-negative kernel. In the one-dimensional case it also holds for any kernel satisfying the 
conditions of Lemma 4.1 in Rigollet (2006). On the difference to Rigollct (2006), Corollary |0] applies to 
multidimensional density estimation. 
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Theorem 15.11 and Corollary 15.11 show that linear or convex aggregate p^' K mimics the best kernel 
estimator, without being itself in the class of kernel estimators with data-driven bandwidth. Another 
method with such a property has been suggested recently by Rigollet (2006) in the one-dimensional case; 
it is based on a block Stein procedure in the Fourier domain. 

The results of this section can be compared to the work on optimality of bandwidth selection in the 
1/2 sense for kernel density estimation. A key reference is the theorem of Stone (1984) establishing that, 
under some assumptions, 

,. \\Pn,h n -P\\ 2 .,, , , 

hm — — — = 1, with probability 1, 

n->oo mi h>0 \\p nth -pY 

where h n is a data-dependent bandwidth chosen by cross-validation. Our results are of a different type, 
because they treat convergence of expected risk rather than almost sure convergence. In addition, we 
provide oracle inequalities with precisely defined remainder terms that hold under mild assumptions on 
the density and on the kernel. Unlike Stone (1984), we do not require the one-dimensional marginals 
of the density p to be uniformly bounded. Wegkamp (1999) considers model selection approach to 
bandwidth choice for kernel density estimation. His main result is of the form of (|5.13|) with a model 
selection kernel estimator in place of Pn' K , but it is valid for bounded, nonnegative, Lipschitz kernels with 
compact support (similar assumptions on K are imposed by Stone (1984)). Our result covers kernels with 
unbounded support, for example, the Gaussian and Silverman's kernels that are often implemented, and 
Pinsker's kernel that gives sharp minimax adaptive estimators on Sobolev classes (cf. Section below) . 
In a recent work of Dalelane (2004) the choice of bandwidth and of the kernel by cross-validation is 
investigated for the one-dimensional case (d— 1). She provides an oracle inequality similar to i|5.3[) with 
a remainder term of the order n 8 ^ 1 , < S < 1, instead of (logn) 3 /n that we have here. 

All these papers consider the model selection approach, i.e., they study estimators with a single data- 
driven bandwidth chosen from a set of candidate bandwidths. Our approach is different since we estimate 
the density by a linear or convex combination of kernel estimators with bandwidths in the candidate set. 
Simulations (see Section below) show that in most cases one of these estimators gets highly dominant 
weight in the resulting mixture. However, inclusion of other estimators with some smaller weights allows 
one to treat more efficiently densities with inhomogeneous smoothness. 



6 Sharp minimax adaptivity of kernel aggregates 

In this section we show that the kernel aggregate defined in Section is sharp minimax adaptive over a 
scale of Sobolev classes of densities. 

For any j3 > 0, Q > and any integer d > 1 define the Sobolev classes of densities on K d by 

e(f3,Q)^\p:JR d ^R p>0, [ p(x)dx = l,[ \\t\\f \ V (t)\ 2 dt < q) , 

where || • \\d denotes the Euclidean norm in M d and cp = T\jp\. Consider the Pinsker kernel Kp, i.e. the 
kernel having the Fourier transform 

^](i)=(l-|lC) +> *eB d , 

where x+ — max(a;,0). Set 

d 2)3 

c ,_ [Q(2(3 + d)}wr-, / ps d \™* 



d(2w) d \(3 + d 

where Sd — 2n d / 2 /T(d/2) is the surface of a sphere of radius 1 in JR d . For d = 1 the value C* equals to 
the Pinsker constant [Pinsker (1980), see also Tsybakov (2004), Chapter 3]. 
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Corollary 6.1 For any integer d > 1 and any (3 > d/2, Q > 0, the averaged linear or convex kernel 
aggregate Pn' Kfi defined in Section\^ satisfies 

sup R n (pt' Kf> ,p) < C*rr5f+3(l + o(l)), n^oo, 
pee(/3 : Q) 

where C* is defined in 16.1J1 . 

Proof. Denote by p n ,h the kernel density estimator defined in (|5.1(l with m — n and K = Kp. Using 
iJEBJ and the fact that < T[Kp\{t) < 1, Vi € M d , we get 



Ra(Pn,h,P) < 7A1 / f|l-^/ 3 ](M)| 2 |^W| 2 + -|^[^/ 3 ](^)r)'i' 



ij(0^ + - / |^[^](M)| 2 diVv/ l >0,pG9(/3,Q). 

Now, choose h satisfying 



- (27 



The solution of (|6.3|) is 

f3S d 



2/3 + d 

h = D*n~w+« where D* 
With /i satisfying 1|6.3[) . inequality 1|6.2[1 becomes 



Q(/3 + d)(2/3 + d) 



Rn(Pn,h, P ) < 7^7- / T[Kp}[ht) T[K p ](ht) + llfcillg 



1 



T[K f3 ](t)dt 

i 

(1 - r' 3 ) r^-^ddr 



Thus, 



(6.2) 



\\tf d F[Kp\{ht)At = Qnh f3 . (6.3) 

R d 



' {2ir) d nh d 
1 

~~ (2n) d nh d J 

2/3 

= C n 2$+*. 

infR n {p nJu p)<C*n~wri, Vpe9(/3,Q). (6.4) 

h>0 

Note that the kernel K = Kp satisfies the conditions of Theorem l5.ll and it is easy to see that for (3 > d/2 
there exists a constant < L < oo such that ||p||oo < L for all p <E 0(/3, Q). Thus, (|5.3|l holds, and to 
prove the corollary it suffices to take suprema of both sides of (|5.3|l over p £ 0(/3, Q) and to use l|6.4|l . | 



Along with Corollary 16. II for any > d/2, Q > the following lower bound holds: 

2/3 

inf sup Rn(T n ,p) > C*n~57J+d(l + o(l)), n -» oo, (6.5) 

T " pee(/3,Q) 

where C* is defined in (|6.1|l and infy„ denotes the infimum over all estimators of p. For d = 1 the bound 
(|6.5(l can be deduced from the results of Golubev (1991, 1992); it is also proven explicitly in Schipper 
(1996) (for integer 0) and in Rigollet (2006), Dalelane (2004) (for all > 1/2). For d > 1 the bound 
(16.51) can be found for a slightly different but essentially analogous minimax setup in Efromovich (2000). 
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Corollary 16.11 and the lower bound l|6.5(l imply that the estimator pn' Kl3 is asymptotically minimax in 
the exact sense (with the constant) over the Sobolev class of densities 0(/3, Q) and is adaptive to Q for 
any given (3. However, pn' Kfi is not adaptive to the unknown smoothness (3 since the Pinsker kernel Kp 
depends on 0, 

To get adaptation to (3, we need to push aggregation one step forward: we will aggregate kernel 
density estimators not only for different bandwidths but also for different kernels. To this end, we refine 
the notation of l|5.1|) to p n ,h.K, indicating the dependence of the density estimator both on kernel 
K and bandwidth h. For a family of N > 2 kernels, JC = {Km, . . . ,K(n)}, define |>„' as the linear 
or convex averaged aggregate where the initial estimators are taken in the collection of kernel density 
estimators {p n ,h,K, K £ JC, h £ H}. Thus, we aggregate now NM estimators instead of M. The following 
corollary is obtained by the same argument as Theorem l5.ll by merely inserting the minimum over K £ JC 
in the oracle inequality and by replacing with its upper or lower bounds in the remainder terms. 

Corollary 6.2 Assume that p satisfies \\p\\oo < L with < L < oo and let JC = {Kn-\, . . . , Krm} 
be a family of kernels satisfying the assumptions of Theorem 15.11 and such that there exist constants 
< c < c < oo with c < \\K(j)\\ < c, j = 1, . . . , N. Then there exists an integer m — n\{L,c, c) > 4 such 
that for n > n\ the averaged aggregate pf^ satisfies the oracle inequality 

Rn(Pn' ,P) < (1 + csQogn)- 1 ) min inf Rn(p n , h , K ,p) + c 7 y B 1 , (6.6) 
v ' kgk. h>o n 

where C5 > is the same constant as in Theorem \5.1\ and c-j > depends only on L,c, c, d and oq. 

Consider now a particular family of kernels JC. Define B = {(3\, . . . , (3n} where (3\ — d/2, f3j = 
(3j-i + N^ 1 / 2 , j = 2, . . . , N, and let JCb = {Kb, b £ 13} be a family of Pinsker kernels indexed by b £ B. 
We will later assume that N — N n — ► 00, as n — > 00, but for the moment assume that TV > 2 is fixed. 
Note that JC = JCb satisfies the assumptions of Corollary 16. 21 In fact, 

\\K,f = S dQm where Q d{ ff) = \ - ^ + ^ , 

and 

j- d <Qcm<\, Vf3>d/2. (6.7) 

Thus, the oracle inequality (|6.6|l holds with JC — JCb- We will now prove that, under the assumptions of 
Corollarv l6.2l the linear or convex aggregate p^ ^ 13 with the initial estimators in {p n ,h,K, K £ JCb, h £ H} 
satisfies the following inequality where (3 in the oracle risk varies continuously: 

R n {f n ^,p)<(l + ^){l + ^=) inf R n (p n ^,p) + c s ^^. (6.8) 

d/2<0</3 N 

Fix (3 £ {d/2,(3 N ),Q > and p £ &{(3,Q). Define (3 = min{/% £ B : /3j > (3}. In view of (JHSJ with 
JC = JCb, to prove (|6.8(l it is sufficient to show that for any h > one has 

Rn(Pn,h,K0,p) < (1 + 67V- 1 / 2 ) \Rn(pn,h,K ,p) + . (6.9) 

Using H5.8fl and the inequality j3 > (3 we get 

Rn(Pn,h,K ? ,p) < R n {pn,h,K fi ,p)+10)-AP) (6.10) 

where 

a 1 r ,,^1^2 . urn 2 s d 
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Now, Q d {[3) = Q d ((3) + (3)Q' d (b ) for some b G \J3,0\. Using and the inequality \Q' d {f3)\ < 1/d 2 
valid for all j3 > d/2, we find that 

Qd0) < Qd(P) + 6(0 - /3)Q d (/3) < (1 + 6N-V 2 )Q d ((3) . 

Therefore, 

Z(/3) < (1 + 6A^ 1/2 )Z(/3). (6.11) 
Also, in view of (|5.8|l and l|5.9|l we have 

l{/3) <Rn(Pn,h, Kl3 ,p) + -- (6.12) 

Combining l|6.10|l . (|6.11|) and l|6.12|l we obtain (|6.9|l . thus proving l|6.8[) . 

Corollary 6.3 Assume that Card(/Cg) = jV„ where linv^^ N n = oo and limsupj^oQ N n /(logn) 1 ' < oo 
for some v > 0. Then for any integer d > 1 and any f3 > d/2, Q > 0, £/ie averaged linear or convex 
kernel aggregate f^ l ' K ' B satisfies 

sup R n (pS- KB ,p) < C*n-^(l + o(l)), 71^00, 
pee(/3,Q) 

where C* is defined in l|6.1|l . 

Proof. Fix (3 > d/2,Q > 0. Let n be large enough to guarantee that [3 < /3at„. Then the infimum on 

the right in l|6.8() is smaller or equal to C*n for all p G 6(/3, Q) [cf. (|6.4() ]. To conclude the proof, it 

suffices to take suprema of both sides of Ij6.8|l over p G 0(/3, Q) and then pass to the limit as n — > oo. | 

Corollary ED and the lower bound l|6.5|l imply that the aggregate pf^ 13 is asymptotically minimax in 
the exact sense (with the constant) over all Sobolev classes of densities with (3 > d/2, Q > 0, and thus it 
is sharp adaptive (recall that its construction does not depend on the parameters Q and [3 of the class). 



7 Simulations 

Here we discuss the results of simulations for the averaged convex kernel aggregate with H — A M in the 
one-dimensional case. We focus on convex aggregation because simulations of linear aggregates show less 
numerical stability. The set of splits S is reduced to 10 random splits of the sample since we observed that 
the estimator is already stable for this number (cf. Figure |3J). In the default simulations each sample is 
divided into two subsamples of equal sizes. The samples are drawn from 6 densities that can be classified 
in the following three groups. 

• Common reference densities: the standard Gaussian density and the standard exponential density. 

• Gaussian mixtures from Marron and Wand (1992) that are known to be difficult to estimate. We 
consider the Claw density and the Smooth Comb density. 

• Densities with highly inhomogeneous smoothness. We consider two densities referenced to as densl 
and dens2 that are both mixtures of the standard Gaussian density ip(-) and of an oscillating density. 
They are defined as 



i=i 

where T — 14 for densl and T = 10 for dens2 



0.5y(-)+0.5g^ 3(< _ 1) 
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We used the procedure denned in Section [5] to aggregate 6 kernel density estimators constructed with the 
Gaussian Af(0, 1) kernel K and with bandwidths h from the set H = {0.001,0.005,0.01,0.05,0.1,0.5}. 
This procedure is further called pure kernel aggregation and quoted as AggPure. Another estimator that 
we analyze is AggStein procedure: it aggregates 7 estimators, namely the same 6 kernel estimators as for 
AggPure to which we add the block Stein density estimator described in Rigollet (2006). The optimization 
problem 1)2. ?>\ that provides aggregates is solved numerically by a quadratic programming solver under 
linear constraints: here we used the package quadprog of R. Our simulation study shows that AggPure 
and AggStein have a good performance for moderate sample sizes and are reasonable competitors to 
kernel density estimators with common bandwidth selectors. 

We start the simulation by a comparison of the Monte-Carlo mean integrated squared squared error 
(MISE) of AggPure and AggStein with benchmarks. The MISE has been computed by averaging inte- 
grated squared errors of 200 aggregate estimators calculated from different samples of size 50, 100, 200 
and 500. We compared the performance of the convex aggregates and kernel estimators with common 
data-driven bandwidth selectors and Gaussian A/"(0, 1) kernel. The following bandwidth selectors are 
taken from the default package stats of the R software. 

• DPI that implements the direct plug-in method of Sheather and Jones (1991) to select the bandwidth 
using pilot estimation of derivatives. 

• UCV and BCV that implement unbiased and biased cross-validation respectively (see, e.g., Wand 
and Jones (1995)). 

• NrdO that implements Silverman's rule-of-thumb [cf. Silverman (1986), page 48]. It defaults the 
choice of bandwidth to 0.9 times the minimum of the standard deviation and the interquartile range 
divided by 1.34 times the sample size to the negative one-fifth power. 

These descriptions correspond to the function bandwidth in R which also allows for another choice of 
rule-of-thumb called Nrd. It is a modification of NrdO given by Scott (1992), using factor 1.06 instead of 
0.9. In our case, on the tested densities and sample sizes, this always leads to a MISE greater than that 
of NrdO except for the Gaussian density for which it is tailored. For this density, the performance of Nrd 
is presented instead of that of NrdO. 

The results are reported in Tables to |21 where we included also the MISE of the block Stein density 
estimator described in Rigollet (2006) and the oracle risk which is defined as the minimum MISE of 
kernel density estimators over the grid Tl. It is, in general, greater than the convex oracle risk, that is 
why it sometimes slightly exceeds the MISE of convex aggregates or of other estimators that mimic more 
powerful oracles for specific densities (such as DPI or Nrd for the Gaussian density). 





50 


100 


150 


200 


500 


AggPure 


0.020 


0.011 


0.008 


0.006 


0.002 


AggStein 


0.017 


0.009 


0.006 


0.005 


0.002 


Stein 


0.016 


0.010 


0.006 


0.005 


0.003 


DPI 


0.011 


0.006 


0.005 


0.004 


0.002 


UCV 


0.015 


0.008 


0.006 


0.005 


0.002 


BCV 


0.009 


0.006 


0.004 


0.003 


0.002 


Nrd 


0.010 


0.006 


0.004 


0.003 


0.002 


Oracle 


0.008 


0.005 


0.004 


0.004 


0.003 



50 


100 


150 


200 


500 


0.084 


0.057 


0.046 


0.039 


0.025 


0.085 


0.057 


0.045 


0.039 


0.025 


0.073 


0.056 


0.046 


0.041 


0.027 


0.075 


0.060 


0.052 


0.045 


0.033 


0.072 


0.052 


0.042 


0.038 


0.023 


0.108 


0.083 


0.070 


0.058 


0.036 


0.085 


0.072 


0.067 


0.061 


0.051 


0.067 


0.047 


0.039 


0.035 


0.022 



Table 1: MISE for the Gaussian (left) and the exponential (right) densities 



It is well known (see, e.g., Wand and Jones (1995)) that bandwidth selection by cross-validation 
(UCV) is unstable and leads too often to undersmoothing. The DPI and BCV methods were proposed 
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50 


100 


150 


200 


500 


AggPure 


0.058 


0.041 


0.034 


0.029 


0.014 


AggStein 


0.056 


0.041 


0.032 


0.025 


0.010 


Stein 


0.061 


0.035 


0.024 


0.018 


0.009 


DPI 


0.059 


0.052 


0.050 


0.048 


0.043 


ucv 


0.063 


0.043 


0.032 


0.026 


0.012 


BCV 


0.058 


0.052 


0.051 


0.050 


0.046 


NrdO 


0.058 


0.051 


0.050 


0.048 


0.043 


Oracle 


0.058 


0.037 


0.029 


0.025 


0.012 



50 


100 


150 


200 


500 


0.064 


0.042 


0.034 


0.029 


0.017 


0.061 


0.042 


0.033 


0.028 


0.017 


0.057 


0.041 


0.033 


0.028 


0.017 


0.070 


0.054 


0.046 


0.042 


0.029 


0.057 


0.038 


0.031 


0.026 


0.016 


0.101 


0.083 


0.066 


0.055 


0.027 


0.088 


0.078 


0.072 


0.069 


0.057 


0.064 


0.038 


0.030 


0.025 


0.016 



Table 2: MIS E for the claw (left) and the smooth comb (right) densities 





50 


100 


150 


200 


500 


AggPure 


0.145 


0.125 


0.111 


0.100 


0.067 


AggStein 


0.148 


0.124 


0.112 


0.102 


0.067 


Stein 


0.152 


0.143 


0.140 


0.138 


0.132 


DPI 


0.149 


0.142 


0.139 


0.137 


0.132 


UCV 


0.153 


0.148 


0.140 


0.136 


0.116 


BCV 


0.149 


0.143 


0.140 


0.139 


0.134 


NrdO 


0.149 


0.141 


0.138 


0.137 


0.133 


Oracle 


0.148 


0.144 


0.142 


0.133 


0.067 



50 


100 


150 


200 


500 


0.142 


0.119 


0.102 


0.093 


0.061 


0.148 


0.141 


0.103 


0.092 


0.060 


0.154 


0.143 


0.140 


0.137 


0.132 


0.147 


0.140 


0.138 


0.136 


0.132 


0.154 


0.142 


0.133 


0.126 


0.074 


0.146 


0.141 


0.139 


0.138 


0.134 


0.146 


0.140 


0.137 


0.136 


0.132 


0.145 


0.128 


0.109 


0.101 


0.062 



Table 3: MISE for densl (left) and dens2 (right) 



in order to bypass the problem of undersmoothing. However, sometimes they lead to oversmoothing 
as in the case of the Claw density while convex aggregation works well. For the normal density DPI, 
BCV and Nrd are better, which comes as no surprise since these estimators are designed to estimate 
this density well. For the other densities that are more difficult to estimate these data driven bandwidth 
selectors do not provide good estimators whereas the aggregation procedures remain stable. The block 
Stein estimator performs well in all the cases except for the highly inhomogeneous densities (cf. Table 
3). In conclusion, the estimators AggPure and AggStein are very robust, as compared to other tested 
procedures: they are not far from the best performance for the densities that are easy to estimate and 
they are clear winners for densities with inhomogeneous smoothness for which other procedures fail. 

AggStein is slightly better than AggPure for the Claw density and outperforms the other tested 
estimators in almost all the considered cases, so we studied this procedure in more detail. We focused on 
the Claw and Smooth Comb densities and a sample of size 500. Figurengi yes a visual comparison of the 
AggStein procedure and the DPI procedure. It illustrates the oversmoothing effect of the DPI procedure 
and the fact that the AggStein procedure adapts to inhomogeneous smoothness. We finally comment on 
two other aspects of the AggStein procedure: 

• the distribution of weights that are allocated to the aggregated estimators, 

• the robustness to the number and size of the splits. 

The boxplots represented in Figure give the distributions of weights allocated to 7 estimators to be 
aggregated, the 6 kernel density estimators and the block Stein estimator. The boxplots are constructed 
from 2000 values of the vector of the weights (200 samples times 10 splits). We immediately notice that 
for the Claw density a median weight greater than 0.65 is allocated to the block Stein estimator. This can 
be explained by the fact that the block Stein estimator performs better than kernel density estimators 
on this density [cf. MISE of AggPure and Stein in Table (left)], and the AggStein procedure takes 
advantage of it. On the other hand, for the Smooth Comb density, the block Stein estimator does not 



18 



the Claw density 



the Smooth Comb density 




Figure 1: The Claw and Smooth Comb densities 

perform significantly better than the kernel density estimators [see Table (right)] and the AggStein 
procedure does not use it at all. For this sample size and this density, the procedures AggStein and 
AggPure are equivalent. 

A free parameter of the aggregation procedures is the set of splits. In this study we choose random 
splits and we only have to specify their number and sizes. Obviously, we are interested to have less 
splits in order to make the procedure less time consuming. Figure [3] gives the sensibility of MISE both 
to the number of splits and to the size of the training sample in the case of densl and dens2 with the 
overall sample size 200. Two important conclusions are: (i) there exists a size of the training sample 
that achieves the minimum MISE, and (ii) there is essentially nothing to gain by producing more than 20 
splits. Similar results are obtained for AggPure, and they are valid on the whole set of tested densities. 

Acknowledgment: We would like to thank the referees for helpful remarks and Lucien Birge for 
suggesting an improvement of the constants in Theorem 13.11 as well as a simplification of its proof. We 
refer to Birge (2006) for comments on a previous version of this paper. 
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Figure 2: Boxplots for the Claw and Smooth Comb densities 



Sensibility to the number of splits of the data 
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Figure 3: Sensibility to the number of splits for densl (left) and dens2 (right) 
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